JAIS Initiative: Nile-Chat Models

Model Overview

overall benchmark scores Nile-Chat is a family of open instruction-tuned models for Egyptian dialect, developed to handle both scripts commonly used in Egypt: Arabic script and Latin-based Arabizi. As part of the Jais project for standard Arabic and its extensions to dialectal Arabic, Nile-Chat is designed to support natural language generation in a way that reflects the script-diverse nature of Egyptian communication. These models are effective for a variety of tasks including question answering, translation and transliteration. Their range of sizes ensures accessibility, from lightweight personal deployments to more powerful setups, enabling broader use of AI technologies for Egyptian Arabic speakers. The family includes two versions:

  • Nile-Chat-4B: A compact 4B parameter model that balances efficiency and fluency, well-suited for generating Egyptian Arabic in both Arabic and Latin scripts.
  • Nile-Chat-12B: A dense 12B parameter model providing high-capacity generation capabilities, ideal for complex, multi-turn interactions and nuanced understanding in both Egyptian scripts.

Paper: Coming Soon!

👥 Our Team

The model is developed by MBZUAI France Lab, an AI research center in Paris affiliated with the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) headquartered in Abu Dhabi.

Usage

Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with:

pip install -U transformers sentencepiece

Then, copy the snippet from the section below.

Running with the pipeline API

import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="MBZUAI-Paris/Nile-Chat-12B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda" # replace with "mps" to run on a Mac device
)

Q1:

messages = [
    {"role": "user", "content": 'اسمك ايه؟'},
]

outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)

A1:

  • Response:

اسمي نايل-شات، على اسم نهر النيل، اطول نهر في العالم، اللي من زمان كان عامل مهم في تطور مصر، وبيساعد في معيشة الناس وأثر على التراث والثقافة بتاعتنا. وعشان انا موديل لغة، الباحثين بتوع جامعة محمد بن زايد للذكاء الاصطناعي دربوني باستخدام مجموعة من المصادر المفتوحة، وعشان كده سموني نايل-شات. لو عندك اي سؤال او محتاج مساعدة في حاجة، اتفضل اسأل!

Q2:

messages = [
    {"role": "user", "content": 'Esmak eh?'},
]
outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)

A2:

  • Response:

Esmi Nile-Chat, 3ala esm nahr el Nile, atwal nahr fel 3alam w mahmo kter men el hadarat 3ala tool el taree5. Ana masroor gedan bek! A2dar asa3dak ezay el naharda?

Training Data

Nile-Chat models were trained on diverse datasets focusing on Egyptian dialect consisting of approximately 3.3B tokens during continual pre-training phase, 1.9M instructions during instruction finetuning and 0.2M samples for DPO, with a maximum length of 2048 tokens, including:

  • Web documents: A diverse collection of Egyptian web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary.
  • Instruction samples created from publicly available Egyptian Arabic datasets including translation and transliteration.
  • Translated English and multi-lingual pretraining and instruction-tuning datasets using Claude 3.5 Sonnet (v2).

The dataset covers both Egyptian Arabic and Latin scripts. Our instruction tuning dataset Egyptian-SFT-Mixture is publicly available.

Implementation Information

Nile-Chat models are based on Gemma 3 models. The Nile-Chat models were trained using 8 NVIDIA A100 80 GB GPUs in parallel using FSDP on AWS Sagemaker. The model is trained using HuggingFace transformers and parameter-efficient fine-tuning with LoRA rank of 256 for both continual pre-training and instruction finetuning, while performing full finetuning for DPO. The continual pre-training is divided into two phases: (i) general pre-training on 2.8B tokens from the Egyptian web and (ii) annealing phase with 0.5B high quality Egyptian text.

Evaluation

Nile-Chat models were evaluated on a comprehensive suite of tasks using various datasets and benchmarks to assess their performance across multiple dimensions. These included tasks such as:

  • EgyptianMMLU: An Egyptian version of ArabicMMLU and MMLU benchmarks.
  • EgyptianHellaSwag: An Egyptian version of HellaSwag (In both scripts Arabic and Latin).
  • Belebele Arz_Arab: Belebele is a multiple-choice machine reading comprehension dataset published by Facebook spanning 122 language variants. The Evaluation is done on the Arz_Arab part of Belebele that refers to Egyptian Arabic.
  • Translation: Including four directions and three languages: Arabic script Egyptian, MSA and English.
  • Transliteration: Transforming a sentence from Egyptian (written in Arabic script) to Arabizi (Written in Latin script) and vice-versa.
  • EgyptianPIQA: An Egyptian version of PIQA benchmark (In both scripts Arabic and Latin).
  • EgyptianWinoGrande: An Egyptian version of WinoGrande benchmark (In both scripts Arabic and Latin).
  • EgyptianRACE: An Egyptian version of RACE benchmark (In both scripts Arabic and Latin).
  • EgyptianOpenBookQA: An Egyptian version of OpenBookQA benchmark.
  • EgyptianAlpacaEval: An Egyptian adaptation of AlpacaEval to assess LLM instruction-following and cultural alignment.

The models were compared against a collection of existing open-source Arabic models to gauge their effectiveness, with a particular focus on performance in Egyptian. All scores are based on zero-shot performance. The prompts are written mainly in Egyptian. We used Language Model Evaluation Harness to conduct these evaluations. All evaluations are done with applying chat template except for EgyptianWinoGrande.

Benchmarks:

Arabic Script Benchmarks

Model Average EgyptianMMLU Belebele Arz EgyptianHellaSwag EgyptianPIQA EgyptianWinoGrande EgyptianOpenBookQA EgyptianRACE High EgyptianRACE Middle EgyptianAlpacaEval
gemma-3-4b-it 48.76 46.0838.5642.5660.3256.4935.7933.6840.0685.30
jais-family-6p7b-chat 46.64 42.6057.3349.1862.2357.0433.3334.7237.5045.86
jais-adapted-7b-chat 42.18 40.9655.6740.8556.5054.3532.8934.6242.3321.45
Qwen2.5-7B-Instruct 49.40 45.7464.2245.4758.0256.4138.7035.4541.7658.80
ALLaM-7B-Instruct-preview 56.40 60.0867.6757.2966.1062.1840.0439.5045.1769.55
c4ai-command-r7b-arabic-02-2025 53.36 50.9770.6750.3961.8457.2036.9141.8946.0273.36
Llama-3.1-8B-Instruct 46.31 42.8855.8943.1057.9754.2735.5734.4140.3452.35
AceGPT-v2-8b-chat 58.33 55.2573.3353.1462.5058.3939.8241.0647.1693.33
gemma-2-9b-it 53.17 50.7249.4449.5361.3561.7935.7940.2348.0181.66
gemma-3-12b-it 59.70 61.5577.0049.4964.9663.5338.0341.2748.8692.61
jais-family-13b-chat 49.81 44.8566.3352.9964.8557.9136.9133.2638.6452.52
jais-adapted-13b-chat 49.80 50.0365.3347.5361.3056.7237.1435.4541.7652.91
Qwen2.5-14B-Instruct 57.34 60.8172.3355.8463.9759.9738.2643.2550.2871.35
Nile-Chat-4B 57.85 50.2568.5655.9267.3061.8740.9442.1046.0287.65
Nile-Chat-12B 64.11 62.5979.4464.0470.6963.5342.0648.0253.1393.50

Latin Script Benchmarks

Model Average EgyptianHellaSwag EgyptianPIQA EgyptianWinoGrande EgyptianRACE High EgyptianRACE Middle
gemma-3-4b-it 36.93 30.9052.7648.5725.4726.94
jais-family-6p7b-chat 37.58 30.2753.2552.1424.1828.06
jais-adapted-7b-chat 37.06 30.8151.6750.4024.3828.06
Qwen2.5-7B-Instruct 36.87 30.5151.8850.9524.8826.11
ALLaM-7B-Instruct-preview 38.58 32.1753.0950.6325.0731.94
c4ai-command-r7b-arabic-02-2025 37.38 30.8852.3251.4325.0727.22
Llama-3.1-8B-Instruct 37.62 31.7753.3050.2424.4828.33
AceGPT-v2-8b-chat 38.77 33.1653.8050.2426.0730.56
gemma-2-9b-it 38.70 33.7553.6950.7926.6628.61
gemma-3-12b-it 41.63 37.5253.1451.1931.0235.28
jais-family-13b-chat 36.96 30.4653.0948.1825.2827.78
jais-adapted-13b-chat 36.98 31.1452.8750.7923.9826.11
Qwen2.5-14B-Instruct 39.48 33.4952.8753.4127.3530.28
Nile-Chat-4B 51.38 50.5565.3260.6237.3643.06
Nile-Chat-12B 53.88 53.7165.1059.9841.7248.89

Translation and Transliteration Tasks:

Model Long Translation Short Translation Transliteration
BLEU chrF BERTScore BLEU chrF BERTScore BLEU chrF BERTScore
gemma-3-4b-it 20.67 44.75 73.03 04.76 31.15 52.98 01.44 20.36 47.54
jais-family-6p7b-chat 12.71 36.53 68.07 08.73 31.52 56.78 00.70 10.64 42.51
jais-adapted-7b-chat 10.61 27.56 63.48 09.19 24.85 53.52 01.11 06.14 40.45
Qwen2.5-7B-Instruct 19.89 44.80 73.64 11.34 36.31 54.96 02.74 20.63 49.32
ALLaM-7B-Instruct-preview 26.57 52.59 78.34 25.20 48.12 65.97 02.10 18.92 49.42
c4ai-command-r7b-arabic-02-2025 25.18 50.26 77.97 23.30 45.34 65.20 03.52 24.57 50.49
Llama-3.1-8B-Instruct 12.90 32.58 68.76 09.06 28.56 54.19 03.26 17.55 48.71
AceGPT-v2-8b-chat 24.59 49.39 77.57 22.47 44.97 66.30 04.80 23.52 49.33
gemma-2-9b-it 23.09 46.98 75.42 11.73 39.00 60.42 02.68 24.28 48.26
gemma-3-12b-it 22.90 45.97 73.46 05.24 32.82 54.34 02.77 26.16 50.47
jais-family-13b-chat 10.41 31.98 64.15 08.64 30.10 57.00 00.84 11.35 44.71
jais-adapted-13b-chat 15.53 41.48 70.86 15.96 38.81 63.52 01.00 13.33 46.08
Qwen2.5-14B-Instruct 21.71 45.55 73.36 09.26 34.21 53.89 04.07 25.83 51.41
Nile-Chat-4B 37.49 58.40 84.30 30.35 52.01 74.07 51.46 80.44 89.59
Nile-Chat-12B 40.53 60.61 85.45 32.2 53.53 74.72 52.21 80.97 89.71

Usage and Limitations

These models have certain limitations that users should be aware of.

Intended Usage

Open Large Language Models (LLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

  • Content Creation and Communication

    • Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
    • Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
    • Text Summarization: Generate concise summaries of a text corpus, research papers, or reports.
  • Research and Education

    • Natural Language Processing (NLP) Research: These models can serve as a foundation for researchers to experiment with NLP techniques, develop algorithms, and contribute to the advancement of the field.
    • Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
    • Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.
Limitations
  • Training Data

    • The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
    • The scope of the training dataset determines the subject areas the model can handle effectively.
  • Context and Task Complexity

    • LLMs perform better on tasks framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
    • A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
  • Language Ambiguity and Nuance

    • Natural language is inherently complex. LLMs might struggle to grasp subtle nuances, sarcasm, or figurative language.
  • Factual Accuracy

    • LLMs generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
  • Common Sense

    • LLMs rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.
  • Ethical Considerations and Risks

    The development of large language models (LLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:

    • Bias and Fairness
      • LLMs trained on large-scale, real-world text data can reflect socio-cultural biases embedded in the training material.
    • Misinformation and Misuse
      • LLMs can be misused to generate text that is false, misleading, or harmful.
      • Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit].
    • Transparency and Accountability:
      • This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes.
      • A responsibly developed open model offers the opportunity to share innovation by making LLM technology accessible to developers and researchers across the AI ecosystem.

    Risks identified and mitigations:

    • Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.
    • Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
    • Privacy violations: Models were trained on data filtered for removal of PII (Personally Identifiable Information). Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
    Downloads last month
    63
    Safetensors
    Model size
    11.8B params
    Tensor type
    BF16
    ·
    Inference Providers NEW
    This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

    Model tree for MBZUAI-Paris/Nile-Chat-12B

    Finetuned
    (32)
    this model
    Quantizations
    2 models

    Dataset used to train MBZUAI-Paris/Nile-Chat-12B

    Collection including MBZUAI-Paris/Nile-Chat-12B